HR ANALYTICS EMPLOYEE ATTRITION AND PERFORMANCE

BCon 147: special topics

Author

Danica S. Prias

Published

October 25, 2024

1 Project overiew

In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance dataset. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.

2 Scenario

Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.

Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.

3 Understanding data source

The dataset used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The dataset is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.

This dataset is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.

#| label: variable descriptions

## datatable function from DT package create an HTML widget display of the dataset
## install DT package if the package is not yet available in your R environment
readxl::read_excel("dataset/dataset-variable-description.xlsx") |> 
  DT::datatable()

4 Data wrangling and management

Libraries

Task: Load the necessary libraries

Before we start working on the dataset, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.

#| label: libraries
# load all your libraries here

library(readr) 
library(readxl)
library(haven) 
library(tidyverse) 
library(tidytext) 
library(dplyr) 
library(ggplot2) 
library(skimr) 
library(magrittr) 
library(janitor) 
library(DT) 
library(GGally) 
library(reshape2) 
library(sjPlot) 
library(report) 
library(ggstatsplot) 
library(RColorBrewer) 
library(scales) 

4.1 Data importation

Task 4.1. Merging dataset
  • Import the two dataset Employee.csv and PerformanceRating.csv. Save the Employee.csv as employee_dta and PerformanceRating.csv as perf_rating_dta.

  • Merge the two dataset using the left_join function from dplyr. Use the EmployeeID variable as the varible to join by. You may read more information about the left_join function here.

  • Save the merged dataset as hr_perf_dta and display the dataset using the datatable function from DT package.

## import the two data here

library(readr)

employee_dta <- read_csv("C:/Users/Danica/Documents/P179 Prias - Special Topics/midterm-bcon147-project-exercise/dataset/Employee.csv")

perf_rating_dta <- read_csv("C:/Users/Danica/Documents/P179 Prias - Special Topics/midterm-bcon147-project-exercise/dataset/PerformanceRating.csv")

## merge employee_dta and perf_rating_dta using left_join function.

merged_data <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID")

## save the merged dataset as hr_perf_dta

hr_perf_dta <- merged_data

## Use the datatable from DT package to display the merged dataset

datatable(hr_perf_dta)

4.2 Data management

Task 4.2. Standardizing variable names
  • Using the clean_names function from janitor package, standardize the variable names by using the recommended naming of variables.

  • Save the renamed variables as hr_perf_dta to update the dataset.

## clean names using the janitor packages and save as hr_perf_dta

hr_perf_dta
# A tibble: 6,899 × 33
   EmployeeID FirstName LastName Gender   Age BusinessTravel Department
   <chr>      <chr>     <chr>    <chr>  <dbl> <chr>          <chr>     
 1 3012-1A41  Leonelle  Simco    Female    30 Some Travel    Sales     
 2 3012-1A41  Leonelle  Simco    Female    30 Some Travel    Sales     
 3 3012-1A41  Leonelle  Simco    Female    30 Some Travel    Sales     
 4 3012-1A41  Leonelle  Simco    Female    30 Some Travel    Sales     
 5 3012-1A41  Leonelle  Simco    Female    30 Some Travel    Sales     
 6 3012-1A41  Leonelle  Simco    Female    30 Some Travel    Sales     
 7 3012-1A41  Leonelle  Simco    Female    30 Some Travel    Sales     
 8 3012-1A41  Leonelle  Simco    Female    30 Some Travel    Sales     
 9 3012-1A41  Leonelle  Simco    Female    30 Some Travel    Sales     
10 CBCB-9C9D  Leonerd   Aland    Male      38 Some Travel    Sales     
# ℹ 6,889 more rows
# ℹ 26 more variables: `DistanceFromHome (KM)` <dbl>, State <chr>,
#   Ethnicity <chr>, Education <dbl>, EducationField <chr>, JobRole <chr>,
#   MaritalStatus <chr>, Salary <dbl>, StockOptionLevel <dbl>, OverTime <chr>,
#   HireDate <chr>, Attrition <chr>, YearsAtCompany <dbl>,
#   YearsInMostRecentRole <dbl>, YearsSinceLastPromotion <dbl>,
#   YearsWithCurrManager <dbl>, PerformanceID <chr>, ReviewDate <chr>, …
hr_perf_dta <- hr_perf_dta %>% clean_names()

## display the renamed hr_perf_dta using datatable function

datatable(hr_perf_dta)
Task 4.2. Recode data entries
  • Create a new variable cat_education wherein education is 1 = No formal education; 2 = High school; 3 = Bachelor; 4 = Masters; 5 = Doctorate. Use the case_when function to accomplish this task.

  • Similarly, create new variables cat_envi_sat, cat_job_sat, and cat_relation_sat for environment_satisfaction, job_satisfaction, and relationship_satisfaction, respectively. Re-code the values accordingly as 1 = Very dissatisfied; 2 = Dissatisfied; 3 = Neutral; 4 = Satisfied; and 5 = Very satisfied.

  • Create new variables cat_work_life_balance, cat_self_rating, cat_manager_rating for work_life_balance, self_rating, and manager_rating, respectively. Re-code accordingly as 1 = Unacceptable; 2 = Needs improvement; 3 = Meets expectation; 4 = Exceeds expectation; and 5 = Above and beyond.

  • Create a new variable bi_attrition by transforming attrition variable as a numeric variabe. Re-code accordingly as No = 0, and Yes = 1.

  • Save all the changes in the hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.

## create cat_education

colnames(hr_perf_dta)
 [1] "employee_id"                        "first_name"                        
 [3] "last_name"                          "gender"                            
 [5] "age"                                "business_travel"                   
 [7] "department"                         "distance_from_home_km"             
 [9] "state"                              "ethnicity"                         
[11] "education"                          "education_field"                   
[13] "job_role"                           "marital_status"                    
[15] "salary"                             "stock_option_level"                
[17] "over_time"                          "hire_date"                         
[19] "attrition"                          "years_at_company"                  
[21] "years_in_most_recent_role"          "years_since_last_promotion"        
[23] "years_with_curr_manager"            "performance_id"                    
[25] "review_date"                        "environment_satisfaction"          
[27] "job_satisfaction"                   "relationship_satisfaction"         
[29] "training_opportunities_within_year" "training_opportunities_taken"      
[31] "work_life_balance"                  "self_rating"                       
[33] "manager_rating"                    
hr_perf_dta <- hr_perf_dta %>% mutate(cat_education = case_when(education == 1 ~ "No formal education", education == 2 ~ "High school", education == 3 ~ "Bachelor", education == 4 ~ "Masters", education == 5 ~ "Doctorate",TRUE ~ NA_character_ ))

## create cat_envi_sat,  cat_job_sat, and cat_relation_sat

hr_perf_dta <- hr_perf_dta %>% mutate(cat_envi_sat = case_when(
    environment_satisfaction == 1 ~ "Very dissatisfied",
    environment_satisfaction == 2 ~ "Dissatisfied",
    environment_satisfaction == 3 ~ "Neutral",
    environment_satisfaction == 4 ~ "Satisfied",
    environment_satisfaction == 5 ~ "Very satisfied",
    TRUE ~ NA_character_
  )) %>%
  
  # recode job satisfaction
  
  mutate(cat_job_sat = case_when(
    job_satisfaction == 1 ~ "Very dissatisfied",
    job_satisfaction == 2 ~ "Dissatisfied",
    job_satisfaction == 3 ~ "Neutral",
    job_satisfaction == 4 ~ "Satisfied",
    job_satisfaction == 5 ~ "Very satisfied",
    TRUE ~ NA_character_
  )) %>%
  
  # recode relationship satisfaction
  
  mutate(cat_relation_sat = case_when(
    relationship_satisfaction == 1 ~ "Very dissatisfied",
    relationship_satisfaction == 2 ~ "Dissatisfied",
    relationship_satisfaction == 3 ~ "Neutral",
    relationship_satisfaction == 4 ~ "Satisfied",
    relationship_satisfaction == 5 ~ "Very satisfied",
    TRUE ~ NA_character_))
  
datatable(hr_perf_dta)
## create cat_work_life_balance, cat_self_rating, and cat_manager_rating

hr_perf_dta <- hr_perf_dta %>% mutate(cat_work_life_balance = case_when(
    work_life_balance == 1 ~ "Unacceptable",
    work_life_balance == 2 ~ "Needs improvement",
    work_life_balance == 3 ~ "Meets expectation",
    work_life_balance == 4 ~ "Exceeds expectation",
    work_life_balance == 5 ~ "Above and beyond",
    TRUE ~ NA_character_
  )) %>%
  
  # recode self-rating
  
  mutate(cat_self_rating = case_when(
    self_rating == 1 ~ "Unacceptable",
    self_rating == 2 ~ "Needs improvement",
    self_rating == 3 ~ "Meets expectation",
    self_rating == 4 ~ "Exceeds expectation",
    self_rating == 5 ~ "Above and beyond",
    TRUE ~ NA_character_
  )) %>%
  
  # recode manager rating
  
  mutate(cat_manager_rating = case_when(
    manager_rating == 1 ~ "Unacceptable",
    manager_rating == 2 ~ "Needs improvement",
    manager_rating == 3 ~ "Meets expectation",
    manager_rating == 4 ~ "Exceeds expectation",
    manager_rating == 5 ~ "Above and beyond",
    TRUE ~ NA_character_
  ))
  
datatable(hr_perf_dta)
## create bi_attrition

hr_perf_dta <- hr_perf_dta %>%
  mutate(bi_attrition = if_else(attrition == "Yes", 1, 0))
datatable(hr_perf_dta)
## print the updated hr_perf_dta using datatable function

datatable(hr_perf_dta)

5 Exploratory data analysis

5.1 Descriptive statistics of employee attrition

Task 5.1. Breakdown of attrition by key variables
  • Select the variables attrition, job_role, department, age, salary, job_satisfaction, and work_life_balance. Save as attrition_key_var_dta.

  • Compute and plot the attrition rate across job_role, department, and age, salary, job_satisfaction, and work_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use the count function to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation as pct_attrition. Do not forget to ungroup before storing the output. Store the output as attrition_rate_job_role.

  • Plot for the attrition rate across job_role has been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!

## selecting attrition key variables and save as attrition_key_var_dta

attrition_key_var_dta <- hr_perf_dta %>%
  select(attrition, job_role, department, age, salary, job_satisfaction, work_life_balance)

## compute the attrition rate across job_role and save as attrition_rate_job_role (with pct_attrition)

attrition_rate_job_role <- attrition_key_var_dta %>%
  group_by(job_role) %>%
  summarise(attrition_count = sum(attrition == "Yes"), 
            total_count = n(),
            pct_attrition = attrition_count / total_count) %>%
  ungroup()

print(attrition_rate_job_role)
# A tibble: 13 × 4
   job_role                  attrition_count total_count pct_attrition
   <chr>                               <int>       <int>         <dbl>
 1 Analytics Manager                      28         213        0.131 
 2 Data Scientist                        597        1387        0.430 
 3 Engineering Manager                    18         307        0.0586
 4 HR Business Partner                     0          25        0     
 5 HR Executive                           29         119        0.244 
 6 HR Manager                              0          17        0     
 7 Machine Learning Engineer              95         582        0.163 
 8 Manager                                19         145        0.131 
 9 Recruiter                              86         152        0.566 
10 Sales Executive                       543        1567        0.347 
11 Sales Representative                  317         500        0.634 
12 Senior Software Engineer               84         512        0.164 
13 Software Engineer                     445        1373        0.324 
# plot attrition rate by job role

custom_fill <- "#5d5d3c"
highlight_color <- "#c7a252"

ggplot(attrition_rate_job_role, aes(x = reorder(job_role, -pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", aes(fill = pct_attrition > median(pct_attrition)),  
           show.legend = FALSE, width = 0.6) +  
  scale_fill_manual(values = c(custom_fill, highlight_color)) + 
  geom_text(aes(label = paste0(round(pct_attrition, 1), "%")), hjust = 1.2, color = "white", size = 3) +  
  labs(title = "Attrition Rate by Job Role", x = "Job Role", y = "Attrition Rate (%)") +
  theme_minimal(base_size = 10) +  
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),  
    axis.title.x = element_text(size = 10, face = "bold"),
    axis.title.y = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(color = "#4d4d4d", size = 8),  
    axis.text.x = element_text(size = 8),
    plot.background = element_rect(fill = "#f5f5f5"),  
    panel.grid.major.x = element_line(color = "gray", linetype = "dashed"),  
    panel.grid.minor = element_blank(),
    plot.margin = unit(c(1, 1, 1, 2), "lines") 
  ) +
  coord_flip()

# compute attrition rate by department 

attrition_rate_department <- attrition_key_var_dta %>%
  group_by(department) %>%
  summarise(attrition_count = sum(attrition == "Yes"), 
            total_count = n(),
            pct_attrition = attrition_count / total_count) %>%
  ungroup()

print(attrition_rate_department)
# A tibble: 3 × 4
  department      attrition_count total_count pct_attrition
  <chr>                     <int>       <int>         <dbl>
1 Human Resources             115         313         0.367
2 Sales                       879        2211         0.398
3 Technology                 1267        4375         0.290
# plot attrition rate by department

custom_fill <- "#b7b7a4"
highlight_color <- "#9d956c"

median_pct_attrition <- median(attrition_rate_department$pct_attrition)

ggplot(attrition_rate_department, aes(y = reorder(department, pct_attrition), x = pct_attrition)) +
  geom_bar(stat = "identity", aes(fill = pct_attrition > median_pct_attrition),  
           show.legend = FALSE, width = 0.6) + 
  scale_fill_manual(values = c(custom_fill, highlight_color)) + 
  geom_text(aes(label = paste0(round(pct_attrition, 1), "%")), 
            vjust = 5, color = "white", size = 3) +  
  labs(title = "Attrition Rate by Department", 
       x = "Attrition Rate (%)", 
       y = "Department") +
  theme_minimal(base_size = 10) +  
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),  
    axis.title.x = element_text(size = 10, face = "bold"),
    axis.title.y = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(color = "#4d4d4d", size = 8),  
    axis.text.x = element_text(size = 8),
    plot.background = element_rect(fill = "#f5f5f5"),  
    panel.grid.major.x = element_line(color = "gray", linetype = "dashed"),  
    panel.grid.minor = element_blank(),
    plot.margin = unit(c(1, 1, 1, 2), "lines")  
  ) +
  coord_flip() 

# compute attrition rate by age

attrition_rate_age <- attrition_key_var_dta %>%
  mutate(age_group = cut(age, breaks = c(20, 30, 40, 50, 60, 70), 
                         labels = c("20-29", "30-39", "40-49", "50-59", "60-69"))) %>%
  group_by(age_group) %>%
  summarise(
    total_employees = n(),
    total_attrition = sum(attrition == "Yes", na.rm = TRUE), 
    pct_attrition = (total_attrition / total_employees) * 100
  ) %>%
  ungroup()

print(attrition_rate_age)
# A tibble: 5 × 4
  age_group total_employees total_attrition pct_attrition
  <fct>               <int>           <int>         <dbl>
1 20-29                3920            1722          43.9
2 30-39                1482             181          12.2
3 40-49                1170             133          11.4
4 50-59                   1               0           0  
5 <NA>                  326             225          69.0
# plot attrition rate by age

custom_fill <- "#d7be82"
highlight_color <- "#6b6431"

median_pct_attrition <- median(attrition_rate_age$pct_attrition)

ggplot(attrition_rate_age, aes(x = reorder(age_group, -pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", aes(fill = pct_attrition > median_pct_attrition), 
           show.legend = FALSE, width = 0.6) + 
  scale_fill_manual(values = c(custom_fill, highlight_color)) + 
  geom_text(aes(label = paste0(round(pct_attrition, 1), "%")), 
            hjust = 1.2, color = "white", size = 3) +  
  labs(title = "Attrition Rate by Age Group", x = "Age Group", y = "Attrition Rate (%)") +
  theme_minimal(base_size = 10) +  
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),  
    axis.title.x = element_text(size = 10, face = "bold"),
    axis.title.y = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(color = "#4d4d4d", size = 8),  
    axis.text.x = element_text(size = 8),
    plot.background = element_rect(fill = "#f5f5f5"),  
    panel.grid.major.x = element_line(color = "gray", linetype = "dashed"),  
    panel.grid.minor = element_blank(),
    plot.margin = unit(c(1, 1, 1, 2), "lines")  
  ) +
  coord_flip()  

# compute attrition rate by salary 

attrition_rate_salary <- attrition_key_var_dta %>%
  mutate(salary_range = cut(salary, 
                             breaks = c(-Inf, 30000, 50000, 70000, 90000, Inf), 
                             labels = c("0-30k", "30k-50k", "50k-70k", "70k-90k", "90k+"))) %>%
  group_by(salary_range) %>%
  summarise(
    total_employees = n(),
    total_attrition = sum(attrition == "Yes", na.rm = TRUE),
    pct_attrition = (total_attrition / total_employees) * 100
  ) %>%
  ungroup()

print(attrition_rate_salary)
# A tibble: 5 × 4
  salary_range total_employees total_attrition pct_attrition
  <fct>                  <int>           <int>         <dbl>
1 0-30k                    673             425          63.2
2 30k-50k                 1469             686          46.7
3 50k-70k                 1095             384          35.1
4 70k-90k                  770             211          27.4
5 90k+                    2892             555          19.2
# plot attrition rate by salary

custom_fill_salary <- "#a1af8b"
highlight_color_salary <- "#768e6a"

median_pct_attrition_salary <- median(attrition_rate_salary$pct_attrition)

ggplot(attrition_rate_salary, aes(x = reorder(salary_range, pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", aes(fill = pct_attrition > median_pct_attrition_salary), 
           color = "white", width = 0.6, show.legend = FALSE) +  
  scale_fill_manual(values = c(custom_fill_salary, highlight_color_salary)) + 
  geom_text(aes(label = paste0(round(pct_attrition, 1), "%")), 
            vjust = -0.5, color = "white", size = 3.5) +  
  labs(title = "Attrition Rate by Salary Range", 
       x = "Salary Range", 
       y = "Attrition Rate (%)") +
  ylim(0, 80) +  
  theme_minimal(base_size = 10) +  
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5, color = "black"),  
    axis.title.x = element_text(size = 10, face = "bold"),
    axis.title.y = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(color = "#4d4d4d", size = 8),  
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    plot.background = element_rect(fill = "#f5f5f5"),  
    panel.grid.major.x = element_line(color = "gray", linetype = "dashed"),  
    panel.grid.minor = element_blank(),
    plot.margin = unit(c(1, 1, 1, 2), "lines")  
  )

# compute attrition rate by job_satisfaction 

attrition_rate_job_satisfaction <- attrition_key_var_dta %>%
  group_by(job_satisfaction) %>%
  summarise(attrition_count = sum(attrition == "Yes"), 
            total_count = n(),
            pct_attrition = attrition_count / total_count) %>%
  ungroup()

print(attrition_rate_job_satisfaction)
# A tibble: 6 × 4
  job_satisfaction attrition_count total_count pct_attrition
             <dbl>           <int>       <int>         <dbl>
1                1              36         130         0.277
2                2             549        1674         0.328
3                3             568        1651         0.344
4                4             573        1685         0.340
5                5             535        1569         0.341
6               NA               0         190         0    
# plot attrition rate by job_satisfaction

attrition_rate_job_satisfaction$job_satisfaction_label <- factor(
  attrition_rate_job_satisfaction$job_satisfaction,
  levels = c(1, 2, 3, 4, 5, 6),
  labels = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied", "NA")
)

ggplot(attrition_rate_job_satisfaction, aes(x = job_satisfaction_label, y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "#696b5c", width = 0.6) +  
  geom_text(aes(label = paste0(round(pct_attrition, 1), "%")), 
            position = position_stack(vjust = 0.5),  
            color = "white", size = 3) + 
  labs(title = "Attrition Rate by Job Satisfaction", x = "Job Satisfaction", y = "Attrition Rate (%)") +
  theme_minimal(base_size = 10) +  
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),  
    axis.title.x = element_text(size = 10, face = "bold"),
    axis.title.y = element_text(size = 10, face = "bold"),
    axis.text.x = element_text(size = 8),  
    axis.text.y = element_text(color = "#4d4d4d", size = 8),  
    plot.background = element_rect(fill = "#f5f5f5"),  
    panel.grid.major.x = element_line(color = "gray", linetype = "dashed"),  
    panel.grid.minor = element_blank(),
    plot.margin = unit(c(1, 1, 1, 1), "lines")  
  )

# compute attrition rate by work_life_balance 

attrition_rate_work_life_balance <- attrition_key_var_dta %>%
  group_by(work_life_balance) %>%
  summarise(attrition_count = sum(attrition == "Yes"), 
            total_count = n(),
            pct_attrition = attrition_count / total_count) %>%
  ungroup()

print(attrition_rate_work_life_balance)
# A tibble: 6 × 4
  work_life_balance attrition_count total_count pct_attrition
              <dbl>           <int>       <int>         <dbl>
1                 1              37         121         0.306
2                 2             568        1702         0.334
3                 3             580        1670         0.347
4                 4             560        1706         0.328
5                 5             516        1510         0.342
6                NA               0         190         0    
# plot attrition rate by work_life_balance

attrition_rate_work_life_balance <- attrition_rate_work_life_balance %>%
  mutate(work_life_balance = case_when(
    work_life_balance == 1 ~ "Unacceptable",
    work_life_balance == 2 ~ "Needs Improvement",
    work_life_balance == 3 ~ "Meets Expectations",
    work_life_balance == 4 ~ "Exceeds Expectations",
    work_life_balance == 5 ~ "Above and Beyond",
    TRUE ~ as.character(work_life_balance)  
  ))

ggplot(attrition_rate_work_life_balance, aes(x = reorder(work_life_balance, pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "#b08968", width = 0.6) +  
  geom_text(aes(label = paste0(round(pct_attrition, 1), "%")), 
            vjust = 5, color = "white", size = 3.5) +  
  labs(title = "Attrition Rate by Work-Life Balance", x = "Work-Life Balance", y = "Attrition Rate (%)") +
  theme_minimal(base_size = 10) + 
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),  
    axis.title.x = element_text(size = 10, face = "bold"),
    axis.title.y = element_text(size = 10, face = "bold"),
    axis.text.x = element_text(size = 9, angle = 45, hjust = 1),  
    axis.text.y = element_text(color = "#4d4d4d", size = 9),
    plot.background = element_rect(fill = "#f5f5f5"),  
    panel.grid.major.x = element_line(color = "gray", linetype = "dashed"),  
    panel.grid.minor = element_blank(),
    plot.margin = unit(c(1, 1, 1, 1), "lines")  
  )

5.2 Identifying attrition key drivers using correlation analysis

Task 5.2. Conduct a correlation analysis to identify key drivers
  • Conduct a correlation analysis of key variables: bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Use the cor() function to run the correlation analysis. Remove missing values using the na.omit() before running the correlation analysis. Save the output in hr_corr.

  • Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the GGally package and use the ggcorr function to visualize the correlation heatmap. You may explore this site for more information: ggcorr.

  • Discuss which factors seem most correlated with attrition and what that suggests aobut why employees are leaving.

## conduct correlation of key variables. 

library(dplyr)

newhr_perf_dta <- hr_perf_dta %>%
  select(bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, work_life_balance) %>%
  na.omit()

str(newhr_perf_dta)
tibble [6,709 × 6] (S3: tbl_df/tbl/data.frame)
 $ bi_attrition     : num [1:6709] 0 0 0 0 0 0 0 0 0 0 ...
 $ salary           : num [1:6709] 102059 102059 102059 102059 102059 ...
 $ years_at_company : num [1:6709] 10 10 10 10 10 10 10 10 10 10 ...
 $ job_satisfaction : num [1:6709] 3 4 5 3 4 2 5 2 5 3 ...
 $ manager_rating   : num [1:6709] 3 2 5 4 3 4 3 4 4 4 ...
 $ work_life_balance: num [1:6709] 4 2 4 3 3 3 4 2 5 5 ...
 - attr(*, "na.action")= 'omit' Named int [1:190] 6422 6423 6424 6425 6426 6427 6428 6429 6439 6440 ...
  ..- attr(*, "names")= chr [1:190] "6422" "6423" "6424" "6425" ...
newhr_perf_dta$job_satisfaction <- as.numeric(as.character(newhr_perf_dta$job_satisfaction))
newhr_perf_dta$work_life_balance <- as.numeric(as.character(newhr_perf_dta$work_life_balance))

str(newhr_perf_dta)
tibble [6,709 × 6] (S3: tbl_df/tbl/data.frame)
 $ bi_attrition     : num [1:6709] 0 0 0 0 0 0 0 0 0 0 ...
 $ salary           : num [1:6709] 102059 102059 102059 102059 102059 ...
 $ years_at_company : num [1:6709] 10 10 10 10 10 10 10 10 10 10 ...
 $ job_satisfaction : num [1:6709] 3 4 5 3 4 2 5 2 5 3 ...
 $ manager_rating   : num [1:6709] 3 2 5 4 3 4 3 4 4 4 ...
 $ work_life_balance: num [1:6709] 4 2 4 3 3 3 4 2 5 5 ...
 - attr(*, "na.action")= 'omit' Named int [1:190] 6422 6423 6424 6425 6426 6427 6428 6429 6439 6440 ...
  ..- attr(*, "names")= chr [1:190] "6422" "6423" "6424" "6425" ...
hr_corr <- cor(newhr_perf_dta)
print(hr_corr)
                  bi_attrition       salary years_at_company job_satisfaction
bi_attrition       1.000000000 -0.211181478    -0.6896527798     0.0132368129
salary            -0.211181478  1.000000000     0.2206442116     0.0053054850
years_at_company  -0.689652780  0.220644212     1.0000000000     0.0008700583
job_satisfaction   0.013236813  0.005305485     0.0008700583     1.0000000000
manager_rating    -0.007654429 -0.001596736     0.0178656879    -0.0158205481
work_life_balance  0.003428836 -0.001517145     0.0079339508     0.0417242942
                  manager_rating work_life_balance
bi_attrition        -0.007654429       0.003428836
salary              -0.001596736      -0.001517145
years_at_company     0.017865688       0.007933951
job_satisfaction    -0.015820548       0.041724294
manager_rating       1.000000000       0.007996938
work_life_balance    0.007996938       1.000000000
## Set CRAN repository for package installation

options(repos = c(CRAN = "https://cloud.r-project.org/"))

## install GGally package and use ggcorr function to visualize the correlation

if (!requireNamespace("GGally", quietly = TRUE)) install.packages("GGally")
if (!requireNamespace("reshape2", quietly = TRUE)) 

library(GGally)
library(reshape2)

corr_matrix <- cor(newhr_perf_dta, use = "complete.obs")
melted_corr <- melt(corr_matrix)

ggplot(data = melted_corr, aes(Var1, Var2, fill = value)) + 
  geom_tile(color = "white") +  
  scale_fill_gradient2(low = "#d4ceba", high = "#ae8853", mid = "#f9f5f3", 
                       midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  geom_text(aes(label = round(value, 2)), color = "black", size = 3) +  
  labs(title = "Correlation Heatmap of Key Variables") +
  theme_minimal(base_size = 10) +  
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", color = "black"),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  
    axis.text.y = element_text(size = 8),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.position = "right",
    panel.border = element_rect(color = "black", fill = NA, size = 1)  
  ) +
  coord_fixed() 

ggsave("correlation_heatmap.png", width = 10, height = 8, dpi = 300)  
Discussion:

The correlation heatmap provides valuable insights into the relationships between key organizational variables and employee attrition. A closer examination reveals several notable patterns:

✦ Attrition and Years at Company: The strongest relationship observed is the negative correlation between attrition and years at the company (-0.69). This suggests that employees who have been with the organization for a longer period are significantly less likely to leave. The negative coefficient implies that as tenure increases, the likelihood of attrition decreases. This finding is consistent with existing literature, which suggests that longer tenure often correlates with higher organizational commitment and job stability.

✦ Attrition and Salary: A moderate negative correlation is found between attrition and salary (-0.21), indicating that employees with higher salaries are somewhat less likely to leave the organization. While this relationship is weaker than that between tenure and attrition, it still points to the importance of competitive compensation in retaining employees. However, the moderate strength of the correlation suggests that salary alone is not a primary determinant of attrition but plays a supportive role.

✦ Other Variables and Attrition: Interestingly, other factors such as work-life balance, manager rating, and job satisfaction show little to no correlation with attrition, with coefficients close to zero. These results imply that, within this dataset, these factors do not directly influence employees’ decisions to stay or leave. This could be due to either the measurement of these variables or the possibility that other unexamined factors may overshadow their influence on attrition.

✦ Years at Company and Salary: A positive correlation between years at the company and salary (0.22) suggests that employees tend to receive higher compensation as their tenure increases. This reflects standard organizational practices, where tenure is often rewarded with incremental salary increases, highlighting the importance of retention strategies in talent management.

✦ Interrelationships Among Other Variables: The remaining variables, including job satisfaction, work-life balance, and manager rating, exhibit very weak correlations with each other, all below 0.05. This indicates that these variables operate relatively independently within the context of this dataset and do not significantly influence each other or overall employee attrition.

In summary, the results suggest that tenure and salary are the most influential factors in predicting employee attrition. Other aspects, such as work-life balance and managerial support, appear to have little direct impact in this particular analysis, though they may still play a role in broader employee satisfaction or engagement models.

5.3 Predictive modeling for attrition

Task 5.3. Predictive modeling for attrition
  • Create a logistic regression model to predict employee attrition using the following variables: salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Save the model as hr_attrition_glm_model. Print the summary of the model using the summary function.

  • Install the sjPlot package and use the tab_model function to display the summary of the model. You may read the documentation here on how to customize your model summary.

  • Also, use the plot_model function to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.

  • Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.

## run a logistic regression model to predict employee attrition
## save the model as hr_attrition_glm_model

hr_attrition_glm_model <- glm(
  bi_attrition ~ salary + years_at_company + cat_job_sat + manager_rating + cat_work_life_balance,
  data = hr_perf_dta,
  family = binomial() )

## print the summary of the model using the summary function

summary(hr_attrition_glm_model)

Call:
glm(formula = bi_attrition ~ salary + years_at_company + cat_job_sat + 
    manager_rating + cat_work_life_balance, family = binomial(), 
    data = hr_perf_dta)

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
(Intercept)                               2.822e+00  1.854e-01  15.221   <2e-16
salary                                   -3.657e-06  4.097e-07  -8.927   <2e-16
years_at_company                         -6.354e-01  1.482e-02 -42.882   <2e-16
cat_job_satNeutral                        1.076e-01  1.039e-01   1.035   0.3006
cat_job_satSatisfied                      3.116e-02  1.046e-01   0.298   0.7658
cat_job_satVery dissatisfied             -4.904e-01  2.666e-01  -1.839   0.0659
cat_job_satVery satisfied                 5.597e-02  1.069e-01   0.524   0.6006
manager_rating                            4.029e-03  3.815e-02   0.106   0.9159
cat_work_life_balanceExceeds expectation -1.369e-01  1.067e-01  -1.283   0.1994
cat_work_life_balanceMeets expectation   -4.194e-02  1.063e-01  -0.394   0.6933
cat_work_life_balanceNeeds improvement   -3.790e-02  1.063e-01  -0.357   0.7214
cat_work_life_balanceUnacceptable        -6.146e-01  2.832e-01  -2.170   0.0300
                                            
(Intercept)                              ***
salary                                   ***
years_at_company                         ***
cat_job_satNeutral                          
cat_job_satSatisfied                        
cat_job_satVery dissatisfied             .  
cat_job_satVery satisfied                   
manager_rating                              
cat_work_life_balanceExceeds expectation    
cat_work_life_balanceMeets expectation      
cat_work_life_balanceNeeds improvement      
cat_work_life_balanceUnacceptable        *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8574.5  on 6708  degrees of freedom
Residual deviance: 4770.0  on 6697  degrees of freedom
  (190 observations deleted due to missingness)
AIC: 4794

Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model

if(!require(sjPlot)) install.packages("sjPlot"); library(sjPlot)
tab_model(hr_attrition_glm_model)
  bi attrition
Predictors Odds Ratios CI p
(Intercept) 16.81 11.71 – 24.23 <0.001
salary 1.00 1.00 – 1.00 <0.001
years at company 0.53 0.51 – 0.55 <0.001
cat job sat [Neutral] 1.11 0.91 – 1.37 0.301
cat job sat [Satisfied] 1.03 0.84 – 1.27 0.766
cat job sat [Very
dissatisfied]
0.61 0.36 – 1.03 0.066
cat job sat [Very
satisfied]
1.06 0.86 – 1.30 0.601
manager rating 1.00 0.93 – 1.08 0.916
cat work life balance
[Exceeds expectation]
0.87 0.71 – 1.07 0.199
cat work life balance
[Meets expectation]
0.96 0.78 – 1.18 0.693
cat work life balance
[Needs improvement]
0.96 0.78 – 1.19 0.721
cat work life balance
[Unacceptable]
0.54 0.31 – 0.94 0.030
Observations 6709
R2 Tjur 0.503
## use plot_model function to visualize the model coefficients

plot_model(hr_attrition_glm_model, 
           type = "est", 
           show.values = TRUE, 
           value.offset = 0.3, 
           title = "Model Coefficients for Employee Attrition",
           colors = c("#606c38", "#c7a252")) + 
  theme_bw() +  
  labs(title = "Model Coefficients for Employee Attrition", 
       x = "Variables", 
       y = "Estimates") +  
  theme(plot.title = element_text(hjust = 0.5, face = "bold", color = "black"),  
        axis.text.x = element_text(face = "bold", color = "black"),  
        plot.margin = unit(c(1, 1, 1, 1), "cm"))

Discussion:

The logistic regression model presents the coefficients for various predictors of employee attrition, illustrating the relative influence of different factors.

✦ Salary is the most significant factor, with a coefficient of 1.00 (***), indicating that higher salaries substantially reduce the likelihood of employee attrition. This aligns with the theory that employees tend to stay with companies where they are compensated well, as financial incentives are one of the strongest motivators for retention.

✦ Years at the company also exhibits a significant impact on retention, with a coefficient of 0.53 (***). This suggests that employees with longer tenure are less likely to leave the organization, reflecting a trend where employees develop stronger ties and loyalty to a company the longer they stay.

✦ Job satisfaction shows mixed effects:

  1. Employees who are very dissatisfied are more likely to leave, as indicated by a coefficient of 0.61. This highlights that dissatisfaction in the workplace is a strong driver of turnover, as dissatisfied employees are more likely to seek alternative employment.

  2. Employees who are neutral, satisfied, or very satisfied have coefficients close to 1.00, suggesting minimal impact on attrition. This may indicate that only extreme dissatisfaction leads to higher attrition rates, while moderate satisfaction does not significantly influence retention decisions.

✦ Work-life balance also contributes to attrition:

  1. An unacceptable work-life balance significantly increases the likelihood of attrition (coefficient 0.54 ***). Employees who perceive an imbalance between work and personal life are more likely to leave, as a poor work-life balance often leads to burnout and disengagement.

  2. Other categories, such as needs improvement or meets expectations, show little to no effect. These results suggest that unless work-life balance is seen as highly inadequate, it may not have a strong influence on whether employees stay or leave.

✦ Manager rating has no significant effect on attrition, with a coefficient of 1.00, indicating it is not a strong predictor in this model. This could suggest that employees do not base their decision to leave primarily on their direct manager’s performance, but rather on other factors like compensation and overall job satisfaction.

Overall, salary, tenure, job dissatisfaction, and unacceptable work-life balance are the key drivers of employee attrition, while other factors have limited influence in this analysis. These results emphasize that addressing compensation and work-life balance concerns, along with improving job satisfaction, are crucial for reducing turnover rates.

5.4 Analysis of compensation and turnover

Task 5.4. Analyzing compensation and turnover
  • Compare the average monthly income of employees who left the company (bi_attrition = 1) and those who stayed (bi_attrition = 0). Use the t.test function to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable called attrition_ttest_results.

  • Install the report package and use the report function to generate a report of the t-test results.

  • Install the ggstatsplot package and use the ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map the bi_attrition variable to the x argument and the salary variable to the y argument.

  • Visualize the salary variable for employees who left and those who stayed using geom_histogram with geom_freqpoly. Make sure to facet the plot by the bi_attrition variable and apply alpha on the histogram plot.

  • Provide recommendations on whether revising compensation policies could be an effective retention strategy.

## compare the average monthly income of employees who left and those who stayed

sal_and_biAtt <- hr_perf_dta %>% select(bi_attrition, salary)
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = sal_and_biAtt)

## print the results of the t-test

print(attrition_ttest_results)

    Welch Two Sample t-test

data:  salary by bi_attrition
t = 18.869, df = 5524.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 38577.82 47523.18
sample estimates:
mean in group 0 mean in group 1 
      125007.26        81956.76 
## install the report package and use the report function to generate a report of the t-test results

if (!require(report)) install.packages("report"); library(report)
attrition_ttest_report <- report(attrition_ttest_results)

print(attrition_ttest_report)
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.25e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43050.50, 95% CI [38577.82, 47523.18], t(5524.24) = 18.87, p < .001; Cohen's d
= 0.51, 95% CI [0.45, 0.56])
# install ggstatsplot package and use ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed

if (!require(ggstatsplot)) install.packages("ggstatsplot"); library(ggstatsplot)

if (!require(RColorBrewer)) install.packages("RColorBrewer")
library(RColorBrewer)

ggbetweenstats(
  data = hr_perf_dta,
  x = bi_attrition,
  y = salary,
  xlab = "Attrition Status",
  ylab = "Monthly Income",
  title = "Monthly Income Distribution by Attrition Status",
  ggtheme = ggplot2::theme_minimal(),
  pairwise.comparisons = TRUE,  
  palette = "Pastel2"  
)

# create histogram of salary for employees who left and those who stayed

if (!require(scales)) install.packages("scales"); library(scales)

salary_bins <- seq(0, 600000, by = 50000)

frequency_df <- hr_perf_dta %>%
  mutate(salary_bin = cut(salary, breaks = salary_bins, right = FALSE)) %>%
  group_by(salary_bin, bi_attrition) %>%
  summarise(Frequency = n(), .groups = 'drop')

print(frequency_df)
# A tibble: 20 × 3
   salary_bin      bi_attrition Frequency
   <fct>                  <dbl>     <int>
 1 [0,5e+04)                  0      1031
 2 [0,5e+04)                  1      1111
 3 [5e+04,1e+05)              0      1563
 4 [5e+04,1e+05)              1       642
 5 [1e+05,1.5e+05)            0       809
 6 [1e+05,1.5e+05)            1       208
 7 [1.5e+05,2e+05)            0       389
 8 [1.5e+05,2e+05)            1        87
 9 [2e+05,2.5e+05)            0       267
10 [2e+05,2.5e+05)            1        93
11 [2.5e+05,3e+05)            0       198
12 [2.5e+05,3e+05)            1        46
13 [3e+05,3.5e+05)            0       156
14 [3e+05,3.5e+05)            1        36
15 [3.5e+05,4e+05)            0        86
16 [3.5e+05,4e+05)            1        20
17 [4e+05,4.5e+05)            0        58
18 [4.5e+05,5e+05)            0        34
19 [5e+05,5.5e+05)            0        47
20 [5e+05,5.5e+05)            1        18
ggplot(hr_perf_dta, aes(x = salary, fill = factor(bi_attrition))) +
  geom_histogram(alpha = 0.6, position = "identity", breaks = seq(0, 600000, by = 50000)) +  
  scale_fill_manual(values = c("#926c15", "#efeee9"), labels = c("Stayed", "Left")) +
  labs(title = "Salary Distribution for Employees Who Stayed vs. Left", 
       x = "Salary", 
       y = "Count", 
       fill = "Attrition Status") +
  scale_x_continuous(limits = c(0, 600000), 
                     breaks = seq(0, 600000, by = 50000), 
                     labels = comma) +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

# create frequency polygon of salary for employees who left and those who stayed

ggplot(hr_perf_dta, aes(x = salary, color = factor(bi_attrition))) +
  geom_freqpoly(linewidth = 1.5, bins = 30) +  
  scale_color_manual(values = c("#d8cba7", "#5e6748"), 
                     labels = c("Stayed", "Left")) +  
  labs(title = "Frequency Polygon of Salary for Employees Who Stayed vs. Left", 
       x = "Salary", 
       y = "Count", 
       color = "Attrition Status") +
  scale_x_continuous(labels = comma) +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

Discussion:

The two visualizations present salary distributions for employees who stayed with the company versus those who left, providing insights into how salary influences employee attrition.

Histogram (Salary Distribution for Employees Who Stayed vs. Left):

✦ Employees who left the company tend to have lower salaries, with a significant concentration in the range of 50,000 to 100,000. This suggests that lower-paid employees are more likely to leave.

✦ In contrast, employees who stayed have a wider salary distribution, with more employees earning higher salaries, stretching up to 600,000. The distribution shows that higher salaries are associated with greater employee retention.

✦ The overlap in lower salary ranges shows that some lower-paid employees still stayed, but overall, higher-paid employees are more likely to remain.

Frequency Polygon (Frequency Polygon of Salary for Employees Who Stayed vs. Left):

✦ This visualization reinforces the histogram’s trends. Employees who left the company show a sharp peak at the lower end of the salary scale, with a gradual decline as salary increases.

✦ For employees who stayed, the curve follows a similar trend at lower salary ranges but shows a longer tail at higher salaries, indicating that retention improves as salary increases.

✦ The frequency of employees who stayed remains higher than those who left at salaries above 100,000, further supporting the idea that higher salaries contribute to employee retention.

Conclusion:

Both visualizations clearly demonstrate that salary plays a crucial role in employee attrition. Employees with lower salaries are more likely to leave, while those with higher salaries are more likely to stay. These patterns highlight the importance of compensation as a critical factor in retention.

5.5 Employee satisfaction and performance analysis

Task 5.5. Analyzing employee satisfaction and performance
  • Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed. Use the group_by and count functions to calculate the average performance ratings for each group.

  • Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot. Use the ggplot function to create the plot and map the SelfRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Similarly, visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot. Make sure to map the ManagerRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition. Use the geom_boxplot function to create the plot and map the salary variable to the x argument, the job_satisfaction variable to the y argument, and the bi_attrition variable to the fill argument. You need to transform the job_satisfaction and bi_attrition variables into factors before creating the plot or within the ggplot function.

  • Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.

# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.

rating_manager_self_filtered <- hr_perf_dta %>%
  filter(!is.na(cat_manager_rating) & !is.na(cat_self_rating)) %>%  
  mutate( cat_manager_rating = as.numeric(factor(cat_manager_rating, levels = c("Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond"))),
    cat_self_rating = as.numeric(factor(cat_self_rating, levels = c("Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond")))
  )

average_ratings_filtered <- rating_manager_self_filtered %>%
  group_by(bi_attrition) %>%
  summarize(
    Average_ManagerRating = mean(cat_manager_rating, na.rm = TRUE),  
    Average_SelfRating = mean(cat_self_rating, na.rm = TRUE),      
  )

print(average_ratings_filtered)
# A tibble: 2 × 3
  bi_attrition Average_ManagerRating Average_SelfRating
         <dbl>                 <dbl>              <dbl>
1            0                  2.48               2.98
2            1                  2.46               2.99
# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.

library(ggplot2)

ggplot(rating_manager_self_filtered, aes(x = factor(cat_self_rating, 
                                      levels = c(1, 2, 3, 4),  
                                      labels = c("Needs improvement", "Meets expectation", 
                                                 "Exceeds expectation", "Above and beyond")), 
                                      fill = factor(bi_attrition))) +
  geom_bar(position = "dodge", alpha = 0.7) +  
  scale_fill_manual(values = c("#5d5d6c", "#c7a252"), 
                    labels = c("Stayed", "Left")) +  
  labs(title = "Distribution of Self Rating for Employees Who Stayed vs. Left", 
       x = "Self Rating", 
       y = "Count", 
       fill = "Attrition Status") +
  theme_minimal()

# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.

ggplot(rating_manager_self_filtered, aes(x = factor(cat_manager_rating, 
                                      levels = c(1, 2, 3, 4),  
                                      labels = c("Needs improvement", "Meets expectation", 
                                                 "Exceeds expectation", "Above and beyond")), 
                                      fill = factor(bi_attrition))) +
  geom_bar(position = "dodge", alpha = 0.7) +  
  scale_fill_manual(values = c("#7d916c", "#b7b7a4"), 
                    labels = c("Stayed", "Left")) + 
  labs(title = "Distribution of Manager Rating  for Employees Who Stayed vs. Left", 
       x = "Self Rating", 
       y = "Count", 
       fill = "Attrition Status") +
  theme_minimal()

# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.

ggplot(hr_perf_dta, aes(x = factor(cat_job_sat, 
                                    levels = c("Very dissatisfied", "Dissatisfied", 
                                               "Neutral", "Satisfied", "Very satisfied")), 
                         y = salary, 
                         fill = factor(bi_attrition))) +
  geom_boxplot(alpha = 0.7) +  
  scale_fill_manual(values = c("#d7be92", "#6b6431"), 
                    labels = c("Stayed", "Left")) +  
  labs(title = "Boxplot of Salary by Job Satisfaction and Attrition Status", 
       x = "Job Satisfaction", 
       y = "Salary", 
       fill = "Attrition Status") +
  theme_minimal()

Discussion:

The three visualizations provide valuable insights into the average performance ratings of employees and their relationships with job satisfaction, salary, and attrition, highlighting the complex dynamics involved in employee retention.

✦ Self-Rating Distribution for Employees Who Stayed vs. Left: Employees who rated themselves as Meets expectation, Exceeds expectation, and Above and beyond show a notably higher tendency to stay. However, a significant proportion of employees who left also gave themselves high ratings. This suggests that self-perception of performance does not directly correlate with retention. Employees may still leave the organization despite positive self-assessments, indicating that other factors such as work environment, career growth opportunities, or external motivations may drive attrition.

✦ Manager Rating Distribution for Employees Who Stayed vs. Left: Manager ratings show a clearer pattern in relation to attrition. Employees rated as Meets expectation and Exceeds expectation by their managers are more likely to stay, while those rated as Needs improvement are more likely to leave. Interestingly, even some employees rated as Above and beyond by managers left the organization, signaling that excellent performance alone does not ensure retention. This may indicate dissatisfaction with non-performance-related aspects, such as work-life balance, company culture, or opportunities for advancement.

✦ Boxplot of Salary by Job Satisfaction and Attrition Status: The salary distribution shows that higher salaries generally align with higher levels of job satisfaction. Employees who stayed tended to have higher median salaries across most satisfaction categories, indicating that compensation may play a role in retention. However, some employees with high job satisfaction and competitive salaries still left the company, particularly in the Very satisfied category. This highlights that while salary is an important factor, it is not sufficient by itself to prevent attrition. Employees may value non-monetary aspects such as work environment, career development, and personal fulfillment.

These results indicate that employee retention is driven by multiple factors, including performance evaluations, compensation, and job satisfaction. High performance ratings and competitive salaries are associated with lower attrition rates, yet they are not sufficient on their own to guarantee employees will stay. Non-monetary factors, such as work environment and career progression opportunities, likely play a crucial role in influencing decisions to leave, especially for high performers and satisfied employees. A comprehensive strategy that addresses both tangible and intangible aspects of the employee experience is essential for improving retention.

5.6 Work-life balance and retention strategies

Task 5.6. Analyzing work-life balance and retention strategies

At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:

## analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.

work_life_balance_summary <- hr_perf_dta %>%
  group_by(bi_attrition, work_life_balance) %>%
  summarise(count = n(), .groups = "drop")

print(work_life_balance_summary)
# A tibble: 11 × 3
   bi_attrition work_life_balance count
          <dbl>             <dbl> <int>
 1            0                 1    84
 2            0                 2  1134
 3            0                 3  1090
 4            0                 4  1146
 5            0                 5   994
 6            0                NA   190
 7            1                 1    37
 8            1                 2   568
 9            1                 3   580
10            1                 4   560
11            1                 5   516
## use visualizations to show the differences.

ggplot(hr_perf_dta, aes(x = factor(work_life_balance), fill = factor(bi_attrition))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Distribution of Work-Life Balance for Employees Who Stayed vs Left",
    x = "Work-Life Balance Rating",
    y = "Count",
    fill = "Attrition (0 = Stayed, 1 = Left)"
  ) +
  theme_minimal() + 
  scale_fill_manual(values = c("#8f857b", "lightgrey"))

Discussion:

The data indicates that there is no clear relationship between work-life balance ratings and employee attrition. Employees who stayed (Attrition = 0) and those who left (Attrition = 1) are fairly evenly distributed across the work-life balance ratings, which range from 1 (poor work-life balance) to 5 (excellent work-life balance).

Notably, even employees with lower work-life balance ratings, such as those rated 1 or 2, did not show a significantly higher rate of attrition compared to those with higher ratings. Across all categories, the majority of employees stayed, regardless of their work-life balance rating.

To further validate these observations, a Pearson’s Chi-squared test was conducted to assess whether a statistically significant relationship exists between work-life balance and attrition.

## assess whether employees with poor work-life balance are more likely to leave.

work_life_balance_attrition <- table(hr_perf_dta$cat_work_life_balance, hr_perf_dta$bi_attrition)
chi_sq_result <- chisq.test(work_life_balance_attrition)
print(chi_sq_result)

    Pearson's Chi-squared test

data:  work_life_balance_attrition
X-squared = 2.138, df = 4, p-value = 0.7104
Discussion:

The test results yielded a Chi-squared value of 2.138 with 4 degrees of freedom, and a p-value of 0.7104. Since this p-value is much greater than the typical significance threshold of 0.05, the results indicate no statistically significant association between work-life balance and attrition.

These findings suggest that work-life balance, as assessed by employees, is not a strong determinant of whether an employee stays or leaves the organization. Both the chart and the statistical analysis point to the conclusion that employees with lower work-life balance ratings are not more likely to leave than those with higher ratings. Therefore, work-life balance ratings do not appear to be a significant predictor of employee attrition in this dataset.

5.7 Recommendations for HR interventions

Task 5.7. Recommendations for HR interventions

Based on the analysis conducted, provide recommendations for HR interventions that could help reduce employee attrition and improve overall employee satisfaction and performance. You may use the following question as guide for your recommendations and discussions.

  • What are the key factors contributing to employee attrition in the company?
Discussion:

The analysis identified several factors contributing to employee attrition:

✦ Years at Company: There is a strong negative correlation between the number of years an employee has been with the company and their likelihood of leaving (-0.69). Employees who have been with the organization for longer periods tend to have a higher level of loyalty and organizational commitment, making them less likely to seek employment elsewhere. This trend aligns with existing research suggesting that tenure often correlates with job stability and increased engagement.

✦ Salary: Salary also plays an important role in employee retention, with a moderate negative correlation (-0.21). Higher salaries are associated with a reduced likelihood of attrition. However, while competitive compensation is a key factor, it is not the sole determinant of whether employees stay or leave. The moderate correlation indicates that while salary is important, it must be addressed alongside other non-monetary factors to have a meaningful impact.

✦ Job Satisfaction: Dissatisfaction in the workplace significantly contributes to attrition. The analysis shows that employees who report being very dissatisfied are more likely to leave, with a coefficient of 0.61. On the other hand, neutral or moderately satisfied employees exhibit little effect on attrition rates, indicating that extreme dissatisfaction is a more critical factor in driving turnover than moderate levels of satisfaction.

✦ Work-Life Balance: A poor work-life balance is another factor that influences attrition, particularly when employees perceive the imbalance as severe. Employees who report an unacceptable work-life balance are more likely to leave the company, as indicated by a coefficient of 0.54. This suggests that while work-life balance is important, its impact on attrition becomes more pronounced only when it is perceived to be highly inadequate.

  • Which factors are most strongly correlated with attrition?
Discussion:

The factors most strongly correlated with employee attrition are:

✦ Tenure (Years at Company): A strong negative correlation of -0.69 indicates that the longer employees remain with the company, the less likely they are to leave. This suggests that retaining employees through their early years with the company is critical, as once employees have developed tenure, they are more likely to stay

✦ Salary: While the correlation is moderate (-0.21), salary is still a significant factor in retention. Employees with higher salaries are less likely to leave, reinforcing the importance of competitive compensation in maintaining employee loyalty.

✦ Job Dissatisfaction: Employees who report extreme dissatisfaction are much more likely to leave, with a significant positive coefficient of 0.61. This highlights that dissatisfaction is a powerful driver of turnover, and addressing it is crucial for retention

✦ Work-Life Balance: Poor work-life balance is also a driver of attrition, though its impact is primarily observed in cases where the balance is perceived as unacceptable (coefficient of 0.54). Employees who experience burnout or a lack of personal time are more inclined to leave.

  • What strategies could be implemented to improve employee retention and satisfaction?
Discussion:

To address the factors contributing to employee attrition, the following strategies should be considered:

✦ Enhanced Compensation and Benefits: Given the moderate yet significant correlation between salary and attrition, it is recommended that the company regularly review its compensation structure to ensure that it remains competitive. Salary increases should particularly target employees in the early stages of their tenure, as this can help reduce early-stage attrition. Offering performance-based bonuses or additional benefits, such as retirement plans, can further strengthen retention efforts.

✦ Career Development and Advancement Opportunities: Since tenure is a key predictor of retention, providing clear career growth paths and advancement opportunities can enhance employee commitment to the company. Initiatives such as mentorship programs, leadership development, and upskilling opportunities can encourage employees to envision a long-term career within the organization.

✦ Improving Job Satisfaction: To mitigate the high risk of attrition among dissatisfied employees, the company should actively monitor and address factors that contribute to dissatisfaction. Regular employee engagement surveys can help identify areas of concern, and targeted interventions—such as improvements to work environments, recognition programs, or enhanced communication from leadership—can boost satisfaction levels.

✦ Promoting Work-Life Balance: Employees who experience an unacceptable work-life balance are at higher risk of leaving the company. HR can introduce flexible working arrangements, such as remote work options or flexible hours, to help employees better manage their personal and professional lives. Additionally, promoting the use of vacation time and reducing excessive workloads can further enhance work-life balance

  • How can HR leverage the insights from the analysis to develop effective retention strategies?
Discussion:

HR departments can utilize the insights from the analysis to develop targeted retention strategies that address the most significant drivers of employee turnover:

✦ Proactive Salary Adjustments: By using salary data, HR can identify employees at risk of leaving due to compensation concerns and offer competitive raises or performance-based bonuses to encourage retention. Compensation reviews can be particularly useful for employees in their early years at the company, as increasing their salary may enhance loyalty.

✦ Focus on Early Retention: Since tenure is strongly correlated with retention, HR should focus on strategies that promote early engagement and retention. This could include structured onboarding programs, mentorship for new hires, and opportunities for rapid advancement within the first few years of employment.

✦ Continuous Monitoring of Job Satisfaction: HR should implement regular pulse surveys to gauge employee satisfaction and address emerging concerns. Employees reporting high levels of dissatisfaction should be engaged with tailored interventions, such as career coaching, additional training, or workload adjustments.

✦ Flexible Work Arrangements: Offering flexible schedules and remote work options can significantly improve work-life balance, reducing the risk of burnout and attrition. HR should assess employee preferences and incorporate flexibility into the company’s work culture.

  • What are the potential benefits of implementing these strategies for the company?
Discussion:

By implementing the recommended HR interventions, the company stands to gain several benefits:

✦ Reduced Employee Turnover: Addressing key factors such as salary, job dissatisfaction, and work-life balance will help decrease attrition rates, leading to a more stable and committed workforce

✦ Enhanced Employee Satisfaction and Engagement: Employees who feel supported in terms of compensation, career development, and work-life balance are more likely to be engaged in their roles, which can lead to higher productivity and improved overall performance.

✦ Cost Savings: Reducing attrition will lower recruitment, training, and onboarding costs associated with replacing employees. Retaining skilled workers reduces the disruption caused by turnover and ensures continuity within teams.

✦ Improved Organizational Reputation: A company that demonstrates a strong commitment to employee satisfaction and well-being will attract top talent and enhance its reputation as an employer of choice in the industry.

The findings from the analysis offer crucial insights into the factors influencing employee attrition and dissatisfaction. Implementing a strategic mix of compensation reviews, career development initiatives, and work-life balance enhancements will enable the company to cultivate a more engaged, satisfied, and stable workforce. These targeted interventions, backed by data-driven insights, will contribute to the company’s sustainable growth while promoting a positive and productive work environment.